-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sparse solver on GPU #600
Sparse solver on GPU #600
Conversation
Example |
@cameronrutherford : there is a Spack Build failure for all the PRs/branches, see #601 |
@cnpetra it seems we are running out of space on PNNL platforms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very good and is a sound way to enable GPU-resident optimization with sparse linear algebra.
I am not sure if using inheritance CUSOLVERGPU -> CUSOLVER is the best approach here, even if this is just a temporary solution as @nychiang suggests. It will lead to a lot of duplicate code, which we will need to remove later on.
Current implementation of of cusolver-lu
module already works with the data on the device. It requires minimal modification to its current code to enable receiving data at HiOp specified memory space. I suggest we work directly with hiopLinSolverSymSparseCUSOLVER
class. Going around with a derived class seems unnecessary.
this PR is supposed to only take care of GPU porting up to the linear solver(s). cuSolver was somehow chosen arbitrarily, just to ensure the rest of the kernels run fine. The porting of linear solvers is a bit more involving and better to be done later on. |
this PR looks great. How do you want to proceed re: PR #589 ? Which one to merge first? |
maybe also merge also PNNL CI is not passing. |
I tried to remove all the member objects from RAJA, but it still fails marianas. Not sure which part fails it. |
hiopMatrixSparse* swap_ptr = M_; | ||
M_ = M_host_; | ||
M_host_ = swap_ptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If sparse matrix is implemented correctly, you really should not have to do this. The matrix class needs to provide both, the device data and the host mirror, as well as methods to sync the two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hiopMatrixSparse
has only got one member function M()
to access its data, while hiopMatrixRajaSparseTriplet
has accessors M()
and M_host()
for the data on host and device to see here.
However, the problem is that previously sparse solver only works on cpu and hybrid. It assumes the KKT matrix M_
is always on host and has type of hiopMatrixSparseTriplet
. Therefore in the existing implementation of hiopLinSolverSparseCUSOLVER.cpp, there are many places use host data `M_->M()' and then transfer it to device (for hybrid usage).
(for example, see here). The current implementation cannot directly work with matrix on device.
If we want to use the existing hiopLinSolverSparseCUSOLVER
without adding a new wrapper class, I need to dynamic_cast M_
to type of hiopMatrixRajaSparseTriplet
first, sync the data, and change M_->M()
to M_->M_host()
in couple of places. This seems more works than adding a new wrapper class.
As a result, we think creating a new wrapper class can minimize the changes to your code, and easier to remove them once the linear solver can support data on device.
The linear solver is already ported to GPU, so I am not sure what else needs to be done there. Regardless of that, creating a |
@pelesh As @cnpetra mentioned, right now I am working on GPU porting up to the linear solver(s). |
e634b6e
to
00d9967
Compare
Note that GPU sparse linear solvers need to set up and run first factorization on the host before moving all data to GPU. This is what sparse HiOp module needs to account for, as well. This is different than MDS, where you can move data to GPU after the initial setup. This is not the case only for sparse linear solver, but also if you have sparse matrix-matrix product, you would want to run first iteration on the host to compute the sparsity pattern of the matrix-matrix product there, and then reuse it for subsequent iterations on GPU. |
MJacS[0] = 4.0; | ||
MJacS[1] = 2.0; | ||
// --- constraint 2 body ---> 2*x_1 + x_3 | ||
MJacS[2] = 2.0; | ||
MJacS[3] = 1.0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this makes sense. Consider rewriting as
MJacS[0] = 4.0; | |
MJacS[1] = 2.0; | |
// --- constraint 2 body ---> 2*x_1 + x_3 | |
MJacS[2] = 2.0; | |
MJacS[3] = 1.0; | |
if(itrow == 0) { | |
MJacS[0] = 4.0; | |
MJacS[1] = 2.0; | |
// --- constraint 2 body ---> 2*x_1 + x_3 | |
MJacS[2] = 2.0; | |
MJacS[3] = 1.0; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this is better added on the host and then copied to GPU, since these are just constant Jacobian elements that stay the same throughout the computation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if
condition is not required here, since here RAJA::RangeSegment
only has one valid number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One missing piece in this PR is a flag that would inform HiOp when to move data to the device. Unlike in dense linear algebra case, data is not moved to GPU after initial setup, but after sparsity patterns for matrix-matrix products and matrix factors are computed. The flag should be set up to true
once HiOp sparse module and the linear solver compute sparsity patterns.
This is pretty much how it should work in the long term. I am not sure that porting initialization to the device will buy you much; it will likely be a performance loss. Most of setup operations are not SIMD and perform far better on CPU. |
when mem_space is gpu, the linear solver class can/should create data copies on cpu if it needs to. See for example: https://github.com/LLNL/hiop/blob/develop/src/LinAlg/hiopLinSolverCholCuSparse.cpp#L228 |
such data movements should be handled by the linear solver class. And there are sparse linear solvers currently in HiOp that do work fully on GPU (but under different options they may do cpu work, like computing min fill-in in the factors in the Cholesky class, see the comment above) |
00d9967
to
5fdb4d5
Compare
This PR shows the sparse solver can run under GPU mode, i.e.,
mem_space = device
andcompute_mode = gpu
.I created a new example
NlpSparseRajaEx2
in this PR.We should remove class
hiopLinSolverSymSparseCUSOLVERGPU
once we have a linear solver that can work with data on device.